Enable computation of CUDA global kernels derivative in reverse mode #1059

kchristin22 · 2024-08-25T15:15:55Z

This PR closes #1036. The CUDA kernels can now be an arg to a clad::gradient function and the derived kernels can be executed successfully.

Details on issues faced and the final approach can be found here: #1036 (comment).

Old PR's approach involved the use of CUDA API libs to compile and load the computed kernel on the GPU.

…evant flag in CladFunction obj

codecov · 2024-08-25T15:22:28Z

Codecov Report

Attention: Patch coverage is 92.59259% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.38%. Comparing base (685bcbf) to head (c42fd53).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
include/clad/Differentiator/Differentiator.h	84.61%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
- Coverage   94.40%   94.38%   -0.02%     
==========================================
  Files          55       55              
  Lines        8430     8445      +15     
==========================================
+ Hits         7958     7971      +13     
- Misses        472      474       +2

Files with missing lines	Coverage Δ
lib/Differentiator/DiffPlanner.cpp	`98.77% <100.00%> (+0.02%)`	⬆️
lib/Differentiator/ReverseModeVisitor.cpp	`97.92% <100.00%> (+<0.01%)`	⬆️
include/clad/Differentiator/Differentiator.h	`80.00% <84.61%> (-3.34%)`	⬇️

Files with missing lines	Coverage Δ
lib/Differentiator/DiffPlanner.cpp	`98.77% <100.00%> (+0.02%)`	⬆️
lib/Differentiator/ReverseModeVisitor.cpp	`97.92% <100.00%> (+<0.01%)`	⬆️
include/clad/Differentiator/Differentiator.h	`80.00% <84.61%> (-3.34%)`	⬇️

github-actions

clang-tidy made some suggestions

github-actions · 2024-08-25T15:22:56Z

include/clad/Differentiator/Differentiator.h

@@ -210,10 +235,34 @@
        printf("CladFunction is invalid\n");
        return static_cast<return_type_t<F>>(return_type_t<F>());
      }
+      if (m_CUDAkernel) {
+        printf("Use execute_kernel() for global CUDA kernels\n");


warning: do not call c-style vararg functions [cppcoreguidelines-pro-type-vararg]

printf("Use execute_kernel() for global CUDA kernels\n"); ^

lib/Differentiator/DiffPlanner.cpp

github-actions · 2024-08-25T15:22:56Z

lib/Differentiator/DiffPlanner.cpp

+    int numArgs = static_cast<int>(call->getNumArgs());
+    if (numArgs > 4) {
+      auto kernelArgIdx = numArgs - 1;
+      auto cudaKernelFlag = new (C) CXXBoolLiteralExpr(


warning: initializing non-owner 'CXXBoolLiteralExpr *' with a newly created 'gsl::owner<>' [cppcoreguidelines-owning-memory]

auto cudaKernelFlag = new (C) CXXBoolLiteralExpr( ^

…-versa

github-actions

clang-tidy made some suggestions

lib/Differentiator/DiffPlanner.cpp

unittests/Misc/CallDeclOnly.cpp

kchristin22 · 2024-08-27T08:18:01Z

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.32%. Comparing base (6f4b081) to head (610850f).
Report is 4 commits behind head on master.

Files Patch % Lines
include/clad/Differentiator/Differentiator.h 75.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
- Coverage   94.34%   94.32%   -0.02%     
==========================================
  Files          55       55              
  Lines        8311     8323      +12     
==========================================
+ Hits         7841     7851      +10     
- Misses        470      472       +2     
Files Coverage Δ
lib/Differentiator/DiffPlanner.cpp 98.76% <100.00%> (+0.01%) ⬆️
lib/Differentiator/VisitorBase.cpp 96.88% <100.00%> (+0.01%) ⬆️
include/clad/Differentiator/Differentiator.h 80.00% <75.00%> (-3.34%) ⬇️
Files Coverage Δ
lib/Differentiator/DiffPlanner.cpp 98.76% <100.00%> (+0.01%) ⬆️
lib/Differentiator/VisitorBase.cpp 96.88% <100.00%> (+0.01%) ⬆️
include/clad/Differentiator/Differentiator.h 80.00% <75.00%> (-3.34%) ⬇️

I have added such a test in GradientKernels.cu to cover these two lines indicated in Differentiator.h:
test.execute(d_a, d_square);
But it doesn't trace it. Does it need to be included in an already present test file?

vgvassilev · 2024-08-27T08:41:47Z

I have added such a test in GradientKernels.cu to cover these two lines indicated in Differentiator.h:
test.execute(d_a, d_square);
But it doesn't trace it. Does it need to be included in an already present test file?

I think the Differentiator.h is difficult to test right now. Let's ignore this report as we know it's covered. I need to figure out how to run the cuda tests, too.

kchristin22 · 2024-08-27T08:50:35Z

I think the Differentiator.h is difficult to test right now. Let's ignore this report as we know it's covered.

Agreed.

I need to figure out how to run the cuda tests, too.

You mean locally?

vgvassilev · 2024-08-27T09:53:28Z

I need to figure out how to run the cuda tests, too.

You mean locally?

No, adding a bot that has a cuda device to run the code we develop. That's an action item on me :)

vgvassilev · 2024-08-27T09:57:56Z

include/clad/Differentiator/Differentiator.h

+            typename std::enable_if<EnablePadding, bool>::type = true>
+  CUDA_HOST_DEVICE void
+  execute_with_default_args(list<Rest...>, F f, list<fArgTypes...>, dim3 grid,
+                            dim3 block, size_t shared_mem, cudaStream_t stream,


Suggested change

dim3 block, size_t shared_mem, cudaStream_t stream,

dim3 block, size_t shared_mem, cudaStream_t stream,

Can we wrap that in a CUDA-specific macro, say CUDA_ARGS, which gets expanded only in cuda mode.

This will allow you to wrap the cudaLaunchKernel and friends with #ifdef __CUDACC__ and allow duplicating the overly complicated templates already...

Alternatively, we can go even deeper and assume that in cuda mode Args has size_t shared_mem, cudaStream_t stream and extract it before forwarding the rest.

parth-07

Thank you for this pull-request. Overall, this looks great!

test/CUDA/GradientKernels.cu

parth-07 · 2024-08-27T16:24:46Z

include/clad/Differentiator/Differentiator.h

@@ -210,10 +235,34 @@ inline CUDA_HOST_DEVICE unsigned int GetLength(const char* code) {
        printf("CladFunction is invalid\n");
        return static_cast<return_type_t<F>>(return_type_t<F>());
      }
+      if (m_CUDAkernel) {
+        printf("Use execute_kernel() for global CUDA kernels\n");


We should probably assert-out if users use execute instead of execute_kernel for CUDA kernels. 'printf' seems to be too subtle for reporting such a big error. @vgvassilev What do you think?

Is that a programmatic error or user error. We can use assert/abort if the error is programmer error.

It is a user error basically. They should use the appropriate execute function depending on whether their function is a CUDA kernel or not.

parth-07 · 2024-08-27T16:26:16Z

include/clad/Differentiator/Differentiator.h

+    template <typename... Args, class FnType = CladFunctionType>
+    typename std::enable_if<!std::is_same<FnType, NoFunction*>::value,
+                            return_type_t<F>>::type
+    execute_kernel(dim3 grid, dim3 block, size_t shared_mem,


Do you think it would be a good idea to provide overloads for execute_kernel that takes default values for shared_mem and stream args?

Sure, I can use the default values of 0 and nullptr

Done, but I needed to use constexpr for this, which is c++17. So cuda tests cannot run with an older c++ version. @vgvassilev is that too big of a problem?

Why do we need constexpr for having default values for shared_mem and stream?

Since the signature for execute_kernel with default args is: dim3, dim3, Args&&...
the function chosen when the args should not be default was the one above instead of the correct: dim3, dim3, size_t, cudaStream_t, Args&&...

Hence, I kept only the signature mentioned first and check if a cudaStream_t arg is included in Args, otherwise I use default values to call execute_with_default_args

@vgvassilev I think it's better to not require the grid parameters to be evaluated at compile time. CUDA doesn't require that in general. So, we can't do this:
execute_kernel<grid, block, shared_mem, stream>(...)

lib/Differentiator/VisitorBase.cpp

lib/Differentiator/DiffPlanner.cpp

…the system

parth-07

Looks good to me.

parth-07 · 2024-09-05T05:41:49Z

@kchristin22 Can you please rebase the PR on top of master?

kchristin22 added 4 commits August 25, 2024 17:26

Create global kernel overload of CUDA kernel function and store a rel…

e8b02d5

…evant flag in CladFunction obj

Add execute_kernel function and test for CUDA global kernel

5417876

Merge branch 'master' into cuda-compilation-support

d5c1427

Merge branch 'master' into cuda-compilation-support

b44f7f4

kchristin22 requested review from vgvassilev and parth-07 August 25, 2024 15:15

kchristin22 self-assigned this Aug 25, 2024

github-actions bot reviewed Aug 25, 2024

View reviewed changes

kchristin22 added 2 commits August 25, 2024 18:33

Check cases of using execute_kernel() for non-global kernels and vice…

a27314c

…-versa

Make cudaKernelFlag a pointer

610850f

github-actions bot reviewed Aug 25, 2024

View reviewed changes

lib/Differentiator/DiffPlanner.cpp Outdated Show resolved Hide resolved

vgvassilev reviewed Aug 26, 2024

View reviewed changes

unittests/Misc/CallDeclOnly.cpp Outdated Show resolved Hide resolved

vgvassilev reviewed Aug 27, 2024

View reviewed changes

parth-07 requested changes Aug 27, 2024

View reviewed changes

kchristin22 added 5 commits September 3, 2024 16:16

Use ActOnCXXBoolLiteral to create cuda flag boolean

b91203a

Use CUDA_ARGS macro to wrap cuda kernel args when cuda is present in …

df7c5c7

…the system

Add ability for user to not specify shared mem and stream ID

784ee25

Fix clang format

4cc966e

Move change of global to device attr in CreateGradientOverload

d5e9a98

parth-07 approved these changes Sep 5, 2024

View reviewed changes

Merge branch 'master' into cuda-compilation-support

c42fd53

vgvassilev merged commit 431791f into vgvassilev:master Sep 5, 2024
88 of 89 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable computation of CUDA global kernels derivative in reverse mode #1059

Enable computation of CUDA global kernels derivative in reverse mode #1059

kchristin22 commented Aug 25, 2024

codecov bot commented Aug 25, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot Aug 25, 2024

github-actions bot Aug 25, 2024

github-actions bot left a comment

kchristin22 commented Aug 27, 2024 •

edited

Loading

Codecov Report

vgvassilev commented Aug 27, 2024

kchristin22 commented Aug 27, 2024

vgvassilev commented Aug 27, 2024

vgvassilev Aug 27, 2024

vgvassilev Aug 27, 2024

kchristin22 Sep 4, 2024

parth-07 left a comment

parth-07 Aug 27, 2024

vgvassilev Aug 27, 2024

kchristin22 Sep 3, 2024

parth-07 Aug 27, 2024

kchristin22 Sep 3, 2024

kchristin22 Sep 4, 2024

parth-07 Sep 4, 2024

kchristin22 Sep 4, 2024

kchristin22 Sep 5, 2024

parth-07 left a comment

parth-07 commented Sep 5, 2024

	dim3 block, size_t shared_mem, cudaStream_t stream,
	dim3 block, size_t shared_mem, cudaStream_t stream,

Enable computation of CUDA global kernels derivative in reverse mode #1059

Enable computation of CUDA global kernels derivative in reverse mode #1059

Conversation

kchristin22 commented Aug 25, 2024

codecov bot commented Aug 25, 2024 • edited Loading

Codecov Report

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Aug 25, 2024

Choose a reason for hiding this comment

github-actions bot Aug 25, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

kchristin22 commented Aug 27, 2024 • edited Loading

Codecov Report

vgvassilev commented Aug 27, 2024

kchristin22 commented Aug 27, 2024

vgvassilev commented Aug 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parth-07 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

parth-07 left a comment

Choose a reason for hiding this comment

parth-07 commented Sep 5, 2024

codecov bot commented Aug 25, 2024 •

edited

Loading

kchristin22 commented Aug 27, 2024 •

edited

Loading